Search indexing with storage #5854

davidfischer · 2019-06-27T00:12:37Z

Changes

Indexes HTMLFiles and ImportedFiles from storage rather than from local disk
Reads intersphinx data from storage (over HTTP) rather than local disk
This change (as currently written) would begin to require settings.RTD_BUILD_MEDIA_STORAGE. This has implications for RTD corporate as well as development.

Note: "storage" could be a filesystem backed storage rather than a remote cloud storage but basically this uses the Django storage abstraction.

stsewd

Nice! I just did a quick look and looks good

davidfischer · 2019-06-27T03:54:39Z

There are going to be some performance implications of this when storage is cloud storage as opposed to local storage. If, however, we stop doing any processing on non-HTML files, the performance difference probably won't matter.

ericholscher

Looks like a good approach, and would benefit from a few cleanup/speedup ideas I've been meaning to implement. I need to test this locally a bit to understand it fully. Really excited to get this working.

ericholscher · 2019-06-27T23:01:10Z

readthedocs/projects/models.py

+                type_='json', version_slug=self.version.slug, include_file=False
+            )
+            try:
+                for fjson_path in fjson_paths:


I believe we should be able to get away from this soon. We should be starting to store the proper path of a file after readthedocs/readthedocs-sphinx-ext#62 is merged.

ericholscher · 2019-06-27T23:04:55Z

readthedocs/projects/tasks.py

+    storage_path = version.project.get_storage_path(
+        type_='html', version_slug=version.slug, include_file=False
+    )
+    for root, __, filenames in storage.walk(storage_path):
        for filename in filenames:
            if filename.endswith('.html'):
                model_class = HTMLFile
            else:


I'd like to add the elif project.cdn_enabled here with an else: continue. 👍

What does project.cdn_enabled do exactly and what would it do differently for search?

It means we store all ImportedFile's not just HTMLFile's, so we can purge them from the CDN properly when they change.

codecov · 2019-07-15T22:30:18Z

Codecov Report

❗ No coverage uploaded for pull request base (master@7be01d8). Click here to learn what that means.
The diff coverage is 34.78%.

@@            Coverage Diff            @@
##             master    #5854   +/-   ##
=========================================
  Coverage          ?   79.26%           
=========================================
  Files             ?      175           
  Lines             ?    10826           
  Branches          ?     1350           
=========================================
  Hits              ?     8581           
  Misses            ?     1891           
  Partials          ?      354

Impacted Files	Coverage Δ
readthedocs/projects/models.py	`82.49% <0%> (ø)`
readthedocs/projects/tasks.py	`64.96% <25%> (ø)`
readthedocs/search/parse_json.py	`71.42% <60%> (ø)`
readthedocs/builds/storage.py	`80.35% <78.57%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7be01d8...bdc84bd. Read the comment docs.

davidfischer · 2019-07-16T08:47:25Z

This should be good for a full review. The test failure seems to be in master and is unrelated.

ericholscher

This looks good 🎉. In general, this should just work locally and in prod, right? Locally & .com it will just use the filesystem, but in prod we'll be hitting backend storage, which might cause some performance issues?

This is going to conflict with a bit of the logic we added in the search updates around GSOC, so I'm going to wait to merge this until after that is merged. It shouldn't be too much, we just had to change a bit of the logic around the ordering of creating HTMLFile's & SphinxDomain's and indexing them.

ericholscher · 2019-07-16T17:15:54Z

readthedocs/projects/tasks.py

    :param commit: Commit that updated path
    :param build: Build id
    """

+    if not settings.RTD_BUILD_MEDIA_STORAGE:
+        return


We should likely log something here, so we don't get confused if indexing isn't working.

ericholscher · 2019-07-16T17:15:56Z

readthedocs/settings/base.py

@@ -234,7 +234,7 @@ def USE_PROMOS(self):  # noqa

    # Optional Django Storage subclass used to write build artifacts to cloud or local storage
    # https://docs.readthedocs.io/en/stable/settings.html#build-media-storage


This link needs to be updated, and the default changed.

codecov · 2019-07-18T18:02:35Z

Codecov Report

❗ No coverage uploaded for pull request base (master@7be01d8). Click here to learn what that means.
The diff coverage is 34.78%.

@@            Coverage Diff            @@
##             master    #5854   +/-   ##
=========================================
  Coverage          ?   79.26%           
=========================================
  Files             ?      175           
  Lines             ?    10826           
  Branches          ?     1350           
=========================================
  Hits              ?     8581           
  Misses            ?     1891           
  Partials          ?      354

Impacted Files	Coverage Δ
readthedocs/projects/models.py	`82.49% <0%> (ø)`
readthedocs/projects/tasks.py	`64.96% <25%> (ø)`
readthedocs/search/parse_json.py	`71.42% <60%> (ø)`
readthedocs/builds/storage.py	`80.35% <78.57%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7be01d8...bdc84bd. Read the comment docs.

davidfischer · 2019-07-18T19:32:30Z

I think this is good for a full review. For .org and for development, this should just work without issue. For .com, I think settings.RTD_BUILD_MEDIA_STORAGE needs to be unset on the build machines (they shouldn't copy files anywhere, the syncers will do that) and should be the default BuildMediaFileSystemStorage on the web machines where indexing happens.

ericholscher

Excited to get this shipped and delete JSON from our disks. 🎉

Should be good to merge w/ conflicts fixed.

davidfischer · 2019-08-07T21:30:56Z

It looks like some of the test directories changed in the last week and that's causing the test failures. I'll get them fixed up.

- Fixed a bug involving the path on imported files - Fixed tests to account for extra files in test dir

davidfischer · 2019-08-07T22:18:37Z

Ok, I believe the tests should pass. The added tests did actually uncover a bug in this implementation. The paths generated were not quite right.

ericholscher · 2019-08-08T17:35:44Z

Great, I'll get this shipped today 👍

Initial pass at search indexing with storage

596bed4

davidfischer added the PR: work in progress Pull request is not ready for full review label Jun 27, 2019

davidfischer requested a review from ericholscher June 27, 2019 00:12

stsewd reviewed Jun 27, 2019

View reviewed changes

ericholscher reviewed Jun 27, 2019

View reviewed changes

Merge branch 'master' into davidfischer/search-indexing-with-storage

a909b36

davidfischer added 2 commits July 15, 2019 16:57

Handle imported files correctly whether a CDN is enabled

ea54d0f

Add tests for the build media storage

a29540e

davidfischer removed the PR: work in progress Pull request is not ready for full review label Jul 16, 2019

ericholscher reviewed Jul 16, 2019

View reviewed changes

davidfischer added 3 commits July 17, 2019 16:22

Update docs and comments on BuildMediaStorage

62863bf

Updates based on feedback

731f8e6

Merge branch 'master' into davidfischer/search-indexing-with-storage

bdc84bd

Ignore result order in walk

a3ef36e

ericholscher approved these changes Jul 31, 2019

View reviewed changes

ericholscher added 2 commits August 7, 2019 12:21

Merge branch 'master' into davidfischer/search-indexing-with-storage

207899e

fix test file

7441b9f

Fixes for storage/importing

f344eeb

- Fixed a bug involving the path on imported files - Fixed tests to account for extra files in test dir

ericholscher merged commit b997f31 into master Aug 8, 2019

ericholscher deleted the davidfischer/search-indexing-with-storage branch August 8, 2019 17:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search indexing with storage #5854

Search indexing with storage #5854

davidfischer commented Jun 27, 2019 •

edited

Loading

stsewd left a comment

davidfischer commented Jun 27, 2019

ericholscher left a comment

ericholscher Jun 27, 2019

ericholscher Jun 27, 2019

davidfischer Jul 8, 2019

ericholscher Jul 9, 2019

codecov bot commented Jul 15, 2019 •

edited

Loading

davidfischer commented Jul 16, 2019

ericholscher left a comment

ericholscher Jul 16, 2019

ericholscher Jul 16, 2019

codecov bot commented Jul 18, 2019

davidfischer commented Jul 18, 2019

ericholscher left a comment

davidfischer commented Aug 7, 2019

davidfischer commented Aug 7, 2019

ericholscher commented Aug 8, 2019

		@@ -234,7 +234,7 @@ def USE_PROMOS(self): # noqa

		# Optional Django Storage subclass used to write build artifacts to cloud or local storage
		# https://docs.readthedocs.io/en/stable/settings.html#build-media-storage

Search indexing with storage #5854

Search indexing with storage #5854

Conversation

davidfischer commented Jun 27, 2019 • edited Loading

Changes

stsewd left a comment

Choose a reason for hiding this comment

davidfischer commented Jun 27, 2019

ericholscher left a comment

Choose a reason for hiding this comment

ericholscher Jun 27, 2019

Choose a reason for hiding this comment

ericholscher Jun 27, 2019

Choose a reason for hiding this comment

davidfischer Jul 8, 2019

Choose a reason for hiding this comment

ericholscher Jul 9, 2019

Choose a reason for hiding this comment

codecov bot commented Jul 15, 2019 • edited Loading

Codecov Report

davidfischer commented Jul 16, 2019

ericholscher left a comment

Choose a reason for hiding this comment

ericholscher Jul 16, 2019

Choose a reason for hiding this comment

ericholscher Jul 16, 2019

Choose a reason for hiding this comment

codecov bot commented Jul 18, 2019

Codecov Report

davidfischer commented Jul 18, 2019

ericholscher left a comment

Choose a reason for hiding this comment

davidfischer commented Aug 7, 2019

davidfischer commented Aug 7, 2019

ericholscher commented Aug 8, 2019

davidfischer commented Jun 27, 2019 •

edited

Loading

codecov bot commented Jul 15, 2019 •

edited

Loading